[kernel-spark] Add getFileChanges() to support Kernel-based DSv2 streaming (Part I) #5313

zikangh · 2025-10-07T23:13:33Z

Which Delta project/connector is this regarding?

Description

This PR is Part I of implementing SparkMicroBatchStream.getFileChanges() to support Kernel-based dsv2 Delta streaming (M1 milestone).

Reads Delta commit range and converts actions to KernelIndexedFile objects with proper indexing and sentinel values.
Basic 1-pass commit validation.

Followups include schema evolution support and initial snapshot support (marked TODO(M1) in code)

How was this patch tested?

Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream).

Does this PR introduce any user-facing changes?

No

zikangh · 2025-10-08T01:11:38Z

Hi @huan233usc @gengliangwang @jerrypeng, could you please help review this PR?

huan233usc · 2025-10-08T17:20:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/KernelIndexedFile.java

+ *
+ * <p>Indexed: refers to the index in DeltaSourceOffset, assigned by the streaming engine.
+ */
+public class KernelIndexedFile {


Nit: just call it IndexedFile? Kernel is just an impl details

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

gengliangwang · 2025-10-08T22:59:06Z

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

+   * from the first row (rowId=0).
+   */
+  public static long getVersion(ColumnarBatch batch) {
+    assert batch.getSize() > 0;


let's follow https://github.com/delta-io/delta/blob/master/kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/RowBackedAction.java#L46 and create a new helper function here

protected int getFieldIndex(String fieldName) { int index = row.getSchema().indexOf(fieldName); checkArgument(index >= 0, "Field '%s' not found in schema: %s", fieldName, row.getSchema()); return index; }

Done. Thank you!

gengliangwang · 2025-10-08T23:36:41Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+        int id = i * 100 + j;
+        insertValues.append(String.format("(%d, 'User%d')", id, id));
+      }
+      spark.sql(String.format("INSERT INTO %s VALUES %s", testTableName, insertValues.toString()));


I wonder if we can following https://github.com/delta-io/delta/blob/master/kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkGoldenTableTest.java#L579 and test many of the golden tables.

I think it might be overkill at this point, especially when there are so many unsupported table types. I agree we should eventually add a test like this. Added a TODO.

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

huan233usc · 2025-10-08T23:39:33Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+          continue;
+        }
+        long version = StreamingHelper.getVersion(batch);
+        validateCommit(batch, version, endOffset);


Should validate happen after processing previous version?

Done, thanks!

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

jerrypeng · 2025-10-09T05:44:48Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+    }
+    CommitRange commitRange = builder.build(engine);
+    // Required by kernel: perform protocol validation by creating a snapshot at startVersion.
+    Snapshot startSnapshot =


Why do you need to get a snapshot even if we start reading from a specific delta log version?

It's required by the kernel to fetch actions:

delta/kernel/kernel-api/src/main/java/io/delta/kernel/CommitRange.java

Line 96 in f3619ef

* @param startSnapshot the snapshot for startVersion, required to ensure the table is readable by

jerrypeng · 2025-10-09T05:53:23Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+    Snapshot startSnapshot =
+        TableManager.loadSnapshot(tablePath).atVersion(startVersion).build(engine);
+    // TODO(M1): This is not working with ccv2 table
+    Set<DeltaAction> actionSet = new HashSet<>(Arrays.asList(DeltaAction.ADD, DeltaAction.REMOVE));


Ideally this is class static variable so it will only be allocated once per query run.

Why do we also need to get the "REMOVE" actions?

Done.

See validateCommit() -- the current behavior of the delta connector is that we fail the pipeline if any commit contains a REMOVE (unless skipDeletes or skipChangeCommits are specified). Streaming jobs are meant to process append-only data

jerrypeng · 2025-10-09T06:06:04Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+        }
+        long version = StreamingHelper.getVersion(batch);
+        // TODO(M1): migrate to kernel's commit-level iterator (WIP).
+        // The current one-pass algorithm assumes REMOVE actions proceed ADD actions


Where are you filtering out the "REMOVE" actions?

We throw an error whenever we encounter a REMOVE -- because ETL jobs should process append-only data. We fail explicitly to avoid correctness issues.
In M2, we'll also support ignoreChangedCommits and ignoreDeletes to skip these commits silently.
To properly handle update actions, users would need to use CDF.

huan233usc · 2025-10-09T21:31:18Z

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

+    }
+
+    Row addFileRow = StructRow.fromStructVector(addVector, rowId);
+    if (addFileRow == null) {


Should we throw here? given -- addVector.isNullAt(rowId) is false

We call this method even on REMOVE rows in extractIndexedFilesFromBatch

if this is a remove row, iiuc the method will return on L175? Did I miss something?

Ah yes. You are right. Done.

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

jerrypeng · 2025-10-10T23:40:27Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+      // A version can be split across multiple batches.
+      long currentVersion = -1;
+      long currentIndex = 0;
+      List<IndexedFile> currentVersionFiles = new ArrayList<>();


Its more performant to use a linked list if we don't actually know the size the list will be

I don't think so. The per-node overhead of a linked list outweighs the cost of resizing of an arraylist, especially for my use case (addAll(), add(), clear()). We would maybe get a performance benefit if we do a lot of deletes and inserts at the beginning or in the middle (which we are not).

jerrypeng · 2025-10-10T23:41:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+
+    // TODO(#5319): check trackingMetadataChange flag and compare with stream metadata.
+
+    result.addAll(dataFiles);


This is not very efficient. "dataFiles" should just be a linked list and you can append and prepend in constant time.

ditto -- I don't think linked lists would help here.

jerrypeng · 2025-10-10T23:49:58Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+        // The current one-pass algorithm assumes REMOVE actions proceed ADD actions
+        // in a commit; we should implement a proper two-pass approach once kernel API is ready.
+
+        if (currentVersion != -1 && version != currentVersion) {


This logic here is kind of confusing. All you are trying to do is sandwich the index files between the BASE_INDEX sentinel file and END_INDEX sentinel file right? Why not simplify the logic to be

allIndexedFiles.add(beginSentinelFile) allIndexedFiles.addAll(allIndexFilesInBatch) allIndexedFiles.add(endSentinelFile)

We only insert sentinels before and after a version. The code is complex because the kernel breaks up a commit into batches (ColumnarBatch) to avoid overwhelming memory. I reorganized the code a bit to make this clear. Could you take another look?

Chatted offline with @zikangh. I had a concern on why we needed to use another list i.e. "currentVersionFiles" to buffer files for this version and not just directly append to allIndexedFiles. The reason is that in the next PR, she is going to introduce the ability to skip whole commits.

jerrypeng · 2025-10-10T23:52:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+   */
+  private List<IndexedFile> extractIndexedFilesFromBatch(
+      ColumnarBatch batch, long version, long startIndex) {
+    List<IndexedFile> indexedFiles = new ArrayList<>();


Use linkedlist

Same rationale as above -- we are only doing addAll() and add(), arrayList would be faster and more memory-efficient.

jerrypeng · 2025-10-11T00:13:55Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+        Arguments.of(
+            0L, BASE_INDEX, isInitialSnapshot, Optional.of(2L), Optional.of(5L), "v0 to v2 id:5"),
+        Arguments.of(
+            1L, 5L, isInitialSnapshot, Optional.of(3L), Optional.of(10L), "v1 id:5 to v3 id:10"),


What about to and from END_INDEX?

Done. Thanks!

jerrypeng · 2025-10-11T00:16:41Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+   */
+  @ParameterizedTest
+  @MethodSource("getFileChangesParameters")
+  public void testGetFileChanges(


Should we also test with other types of actions in the delta log?

We do test REMOVEs & ADDs, I added METADATA too (which now will yield empty commits).

huan233usc · 2025-10-13T17:46:11Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+              "Index mismatch at index %d: dsv1=%d, dsv2=%d",
+              i, deltaFile.index(), kernelFile.getIndex()));
+
+      String deltaPath = deltaFile.add() != null ? deltaFile.add().path() : null;


nit: document that deltaFile.add() != null could happen when it is starting/ending index

huan233usc

I think it is good as a starting point.

tdas · 2025-10-15T15:59:17Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/IndexedFile.java

+ *
+ * <p>Indexed: refers to the index in DeltaSourceOffset, assigned by the streaming engine.
+ */
+public class IndexedFile {


is this duplicating any code that already present in v1 delta connector?

Yes we are duplicating the scala version of IndexedFile, because the logic is simple enough and pulls in DeltaLog-only dependencies AddFile and RemoveFile. If I create a common interface for AddFile and RemoveFile to bridge kernel and DeltaLog, the code would be harder-to-maintain and more error-prone.

However there are cases where refactoring is a clear win, e.g. DeltaSourceAdmissionBase and AdmissionLimits -- the logic is complex and a small refactor would make them work for both Kernel and DeltaLog classes.

w.r.t duplicating vs sharing code, I'm weighing the pros and cons on a case by case basis following the principles I outlined above.

tdas

very high level approval. @jerrypeng you are much closer to the streaming than i have been for years, please take a detailed look 🙏

my general high level question, not a concern, is how much of this new code is duplicating existing v1 connector code? do we need to duplicate or can we refactor the existing code just enough to referenced and reusable in the v2 connector?

zikangh · 2025-10-15T18:36:26Z

my general high level question, not a concern, is how much of this new code is duplicating existing v1 connector code? do we need to duplicate or can we refactor the existing code just enough to referenced and reusable in the v2 connector?

Thanks @tdas
IndexedFile and these boundary checks are duplicated in this PR.

jerrypeng · 2025-10-16T20:14:17Z

my general high level question, not a concern, is how much of this new code is duplicating existing v1 connector code? do we need to duplicate or can we refactor the existing code just enough to referenced and reusable in the v2 connector?

@tdas we would like to reuse existing v1 code as much as possible to reduce the risk bugs and ease of code reviews. However, the due to the difference in the API between the Delta Kernel and the DeltaLog API (used in DSv1), some of the code may have to be duplicated and rewritten unfortunately.

Yes, this makes detailed reviews and good tests very important.

getFileChanges 1

477556a

zikangh mentioned this pull request Oct 7, 2025

getFileChanges part 1 zikangh/delta#2

Open

5 tasks

zikangh changed the title ~~[kernel dsv2 streaming] Add logic that reads the delta commit log in preparation for streaming read (Part I)~~ [kernel dsv2 streaming] Add logic that reads the delta commit log to determine offsets for streaming read (Part I) Oct 7, 2025

format

f881fb3

zikangh changed the title ~~[kernel dsv2 streaming] Add logic that reads the delta commit log to determine offsets for streaming read (Part I)~~ [kernel-spark] Add getFileChanges() to support Kernel-based DSv2 streaming (Part I) Oct 7, 2025

format again

a91bc67

huan233usc requested review from gengliangwang and huan233usc October 8, 2025 04:15

huan233usc reviewed Oct 8, 2025

View reviewed changes

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

huan233usc reviewed Oct 8, 2025

View reviewed changes

zikangh added 2 commits October 9, 2025 00:44

address comments

ecc3f27

add todo

d91f873

gengliangwang reviewed Oct 9, 2025

View reviewed changes